Stylometric Analysis of Bloggers' Age and Gender
نویسندگان
چکیده
We report results of stylometric differences in blogging for gender and age group variation. The results are based on two mutually independent features. The first feature is the use of slang words which is a new concept proposed by us for Stylometric study of bloggers. Slang is a non-dictionary word that has evolved with time due to its frequent and popular usage. For the second feature, we have analysed the variation in average length of sentences across various age groups and gender. These two features are then augmented with previous study results reported in literature for stylometric analysis of age and gender. The combined feature list enhances the accuracy by a remarkable extent in predicting age and gender. These experiments were done on a 20,000 blog corpus. Experimental results show that these features work well in detection of bloggers’ demography. However, gender determination is more accurate than age group detection over a data spread across all ages but the accuracy of age prediction increases if we sample data with remarkable age difference. Introduction Gender and age are the common demographic features used for experimentation using stylometry as the blogs generally contain these information provided by the author. Style in writing is a result of the subconscious habit of the writer of using one form over a number of available options to present the same thing. The variation also evolves with the usage of the language in certain period, genre, situation or individuals. Variation are of two types – variation within a norm which is grammatically correct and deviation from the norm which in ungrammatical. The variations can be described in linguistic as well as statistical terms (McMenamin 2002). Concept and themes (Leximancer 2008; Weber 1990) can be determined from variations within the norm while usage of non-dictionary words or slang is an example of deviation from a norm. Blogs substantially reduced the technical and language skills required to publish. It has brought forward a wide variety of reporting techniques, content type, style and goals of blogging. Bloggers generally express their thoughts in an informal, unreserved and unorganized manner through the blogs. The language used here has a mixed characteristic of spoken and written language constructs like use of jargons, abbreviations, too many exclamations, short sentences, emotion symbols etc. The topics which were considered private are openly discussed by the teenagers and young adults (Mishne 2006).
منابع مشابه
Learning Age and Gender Using Co-occurrence of Non-dictionary Words from Stylistic Variations
This work attempts to report the stylistic differences in blogging for gender and age group variations using slang word co-occurrences. We have mainly focused on co-occurrence of non dictionary words across bloggers of different gender and age groups. For this analysis, we have focused on the feature use of slang words to study the stylistic variations of bloggers across various age groups and ...
متن کاملEffects of Age and Gender on Blogging
Analysis of a corpus of tens of thousands of blogs – incorporating close to 300 million words – indicates significant differences in writing style and content between male and female bloggers as well as among authors of different ages. Such differences can be exploited to determine an unknown author’s age and gender on the basis of a blog’s vocabulary.
متن کاملLearning Age and Gender of Blogger from Stylistic Variation
We report results of stylistic differences in blogging for gender andagegroupvariation.Theresultsarebasedontwomutually independent features. The first feature is the use of slang words which is a new concept proposed by us for Stylistic study of bloggers. For the second feature, we have analyzed the variation in average length of sentences across various age groups and gender. These features ar...
متن کاملStylometric Analysis of Parliamentary Speeches: Gender Dimension
Relation between gender and language has been studied by many authors, however, there is still some uncertainty left regarding gender influence on language usage in the professional environment. Often, the studied data sets are too small or texts of individual authors are too short in order to capture differences of language usage wrt gender successfully. This study draws from a larger corpus o...
متن کاملAutomatic Estimation of Bloggers' Gender
We propose an approach employing Support Vector Machine (SVM) to estimate bloggers’ gender from blog posts. The data we analyze consists of blog posts on Doblog (Japanese blog-hosting service) and questionnaire results by Doblog users. Experimental evaluations show that our approach achieved 90% accuracy for 83% bloggers.
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2009